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Abstract: 

We derive an information criterion to select a parametric model of complete-data distri¬ 
bution when only incomplete or partially observed data is available. Compared with AIC, 
our new criterion has an additional penalty term for missing data, which is expressed 
by the Fisher information matrices of complete data and incomplete data. We prove 
that our criterion is an asymptotically unbiased estimator of complete-data divergence, 
namely, the expected Kullback-Leibler divergence between the true distribution and the 
estimated distribution for complete data, whereas AIC is that for the incomplete data. In¬ 
formation criteria PDIO (Shimodaira 1994) and AICcd (Cavanaugh and Shumway 1998) 
have been previously proposed to estimate complete-data divergence, and they have the 
same penalty term. The additional penalty term of our criterion for missing data turns 
out to be only half the value of that in PDIO and AICcd. The difference in the penalty 
term is attributed to the fact that our criterion is derived under a weaker assumption. A 
simulation study with the weaker assumption shows that our criterion is unbiased while 
the other two criteria are biased. In addition, we review the geometrical view of alternat¬ 
ing minimizations of the EM algorithm. This geometrical view plays an important role in 
deriving our new criterion. 

Keywords and phrases: Akaike information criterion. Alternating projections. Data 
manifold, EM algorithm, Fisher information matrix. Incomplete data, Kullback-Leibler 
divergence, Misspecihcation, Takeuchi information criterion. 


1. Introduction 

Modeling complete data X = {Y, Z) is often preferable to modeling incomplete or partially 
observed data Y when missing data Z is not observed. The expectation-maximization (EM) 
algorithm (Dempster, Laird and Rubin, 1977) computes the maximum likelihood estimate of 
parameter vector 0 for a parametric model of the probability distribution of X. In this re- 
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search, we consider the problem of model selection in snch sitnations. For mathematical sim¬ 
plicity, we assnme that X consists of independent and identically distribnted random vec¬ 
tors. More specihcally, X = {xi,X2,... ,Xn), and the complete-data distribntion is modeled 
as Xi,X2,... ,Xn ~ Px{x; 0 ). Each vector is decomposed as x"^ = and the marginal 

distribntion is expressed as Py{y, 6) = J Px{y, Z] 6) dz, where T denotes the matrix transpose 
and the integration is over all possible valnes of z. We formally treat y,z as continnous random 
variables with the joint density fnnction p^. However, when they are discrete random variables, 
the integration shonld be replaced with a snmmation of the probability fnnctions. We use sym¬ 
bols such as Px and Py for both the continuous and discrete cases, and simply refer to them as 
distributions. 

The log-likelihood function is (-yiO) = with the parameter vector 6 = 

( 6 ^ 1 ,..., Od)'^ G M'^. We assume that the model is identihable, and the parameter is restricted to 
0 C Then the maximum likelihood estimator (MLE) of 6 is dehned by Oy = argmax iy{6)- 

The dependence of iy{0) and 6y on Y = {yi,... ,yn) is suppressed in the notation. Akaike 
(1974) proposed the information criterion 

AIC = -2iy{ey) + 2d 

for model selection. The hrst term measures the goodness of £t, whereas the second term is 
interpreted as a penalty for model complexity. The AIC values for candidate models are com¬ 
puted, and then the model that minimizes AIC is selected. This information criterion estimates 
the expected discrepancy between the unknown true distribution of y, which is denoted as 
and the estimated distribution py{6y). This discrepancy is measured by the incomplete-data 
Kullback-Leibler divergence. 

In this study, we work on the complete-data Kullback-Leibler divergence instead of the 
incomplete-data counterpart. An information criterion to estimate the expected discrepancy 
between the unknown true distribution of x, which is denoted as and the estimated dis¬ 
tribution Px{0y) is derived. This approach makes sense when modeling complete data more 
precisely describes the part being examined. Similar attempts are found in the literature. Shi- 
modaira (1994) proposed the information criterion PDIO (predictive divergence for incomplete 
observation models) 

PDIO = -2iy{6y) +2tT{lxiey)ly{ey)-^). 

The two matrices in the penalty term are the Fisher information matrices for complete data 
and incomplete data. They are dehned by 

Ix{.d) — j Pxix,6) QQQQT 
Iy{e) = -J Py{y-0) -- dy. 
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Let pz\y{z\y, 6) = Pxiy, Z] 6)/py{y; 6) be the conditional distribution of z given and Iz\y{G) = 
lx{0) — Iy{G) be the Fisher information matrix for pz\y. Since lz\y{0) is nonnegative dehnite, 
we have tr(4(0)4(0)^^) = tr((4(0) + I^\y{6))Iy{6)~'^) = d + tr(4|y(0)4(0)"^) > d. Thus, the 
nonnegative difference 

PDIO - AIC = 2tr(4|^(4)4(4)-') 

is interpreted as the additional penalty for missing data. There are similar attempts in the 
literature (Cavanaugh and Shumway, 1998; Seghouane, Bekara and Fleury, 2005; Claeskens and 
Consentino, 2008; Yamazaki, 2014). In particular, Cavanaugh and Shumway (1998) proposed 
another information criterion 

AlCcd = —2Q{6y] 6y) + 2 tr(4(0y)4(^J/) ^) 

by replacing iy{0y) in PDIO with Q{0y] 6y) to measure the goodness of £t. It should be noted 
that cd stands for complete data. This is the function introduced in Dempster, Laird and Rubin 
(1977) for the EM algorithm, and is dehned by 

n p 

Q{e2-,6i) = '^ / p^iy{z\yt, 6 i)\ogpx{yt,z-, 62 )dz. 

t=i 

We recently found that the assumption in Shimodaira (1994) to derive PDIO is unnecessarily 
strong. Additionally, the same assumption explains the derivation of AICcd- In this paper, we 
derive a new information criterion under a weaker assumption. The updated version of PDIO 
is 

AICa;;j, = —2f.y{0y) + (7 + tl (^Ix {O y) Iy {0 y) ^). 

The hrst suffix x indicates that a random variable is used to measure the discrepancy, while 
the second suffix y indicates a random variable is used for the observation. Then the additional 
penalty for missing data becomes 

AIC,;, - AIC = tr(/.|,(e,)4(0,)-'). (1) 

The additional penalty is only half the value of that in PDIO. In practice, the computation of 
AICj;;y as well as the related criteria PDIO and AICcd is not very difficult. The SEM algorithm 
of Meng and Rubin (1991) provides a shortcut to compute the penalty term ti{Ix{6y)Iy{0y)~^) 
without computing the two Fisher information matrices as described in Shimodaira (1994) and 
Cavanaugh and Shumway (1998). 

To derive AlC,x-,yi we hrst review the basic properties of Kullback-Leibler divergence for 
incomplete data in Section 2. Section 3 considers those for complete data. Although these 
results are not new, they are crucial for the argument in later sections. In particular, the 
geometrical view of alternating minimizations (Csiszar and Tusnady, 1984; Amari, 1995) in 
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Section 3.3 is important to understand why the goodness of £t term of AlCa:;^ is expressed by 


the incomplete-data likelihood function instead of the complete-data counterpart. 


Section 4, which begins the argument of model selection, discusses what the information crite¬ 
ria should estimate. In general, parametric models are misspecihed, and we do not assume that 
the true distribution is expressed as = Pxi^o) using the “true” parameter value 6q. However, 
the unbiasedness of P^lCx-y is based on the assumption that Pz\y{G) is correctly specihed for qz\y. 
In Section 5, we derive our new information criterion. The argument is very straightforward; it 
simply follows the argument for the robust version of AIC, which is also known as the Takeuchi 
information criterion (TIC) that is described in Burnham and Anderson (2002) and Konishi 
and Kitagawa (2008). Section 6 compares the assumptions used to derive PDIO and AICcd to 
those of PACx-y Section 7 presents a simulation study to verify the theory. Finally, Section 8 
contains some concluding remarks. Proofs are deferred to the Appendix. 

2. Incomplete-data divergence 

Here we review Kullback-Leibler divergence and the asymptotic distribution of MLE under 
model misspecihcation (White, 1982). Let Qy and fy be the arbitrary probability distributions 
of incomplete data. The incomplete-data Kullback-Leibler divergence from gy to fy is 


^yidy'Jy) = “ / 9yiv) (}og fy{y) - log gy{y))dy, 


where Dy{gy-, fy) > 0 and the equality holds for gy = fy (Csiszar, 1975; Amari and Nagaoka, 
2007). The cross-entropy is 



and the entropy is Ly{gy) = Ly{gy]gy). Instead of minimizing Dy{gy] fy) = Ly{gy] fy) - Ly{gy) 
with respect to fy, we minimize Ly{gy-, fy), because Ly{gy) is independent of fy. 

For the true distribution qy and the parametric model Py{0), we consider the minimization 
of Dy{qy-,py{0)) wFli rospoct to 6. The optimal parameter value is dehned by 


Oy = argmin Ly{qy-,py{6)). 


This minimization is interpreted geometrically as a “projection” of qy to the model manifold 
Myipy) as illustrated in Fig. 1 (a). Let My{py) = {py{6) : G 0} be the set of Py{6) with all 

possible parameter values. Then the projection is defined as 
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The projection Py{Oy) is the best approximation of qy in My{py) when the discrepancy is mea- 
snred by the Knllback-Leibler divergence. We assnme that the parametric model is generally 
misspecified and qy ^ My{py). Later, we also consider the situation where the parametric model 
is correctly specified and qy G My{py). In the correctly specified case, 6y is the true parameter 
value in the sense that qy = py{6y). 

Similar to the optimal parameter value, the maximum likelihood estimator is interpreted as a 
projection of qy to My{py). Let qyifu) = ^ XltLi ^iy~yt) be the empirical distribution of y for the 
observed incomplete data yi,... ,yn. Here 5(-) denotes the Dirac delta function for continuous 
random variables, or is simply the indicator function for discrete random variables such that 
5{y—yt) = Hot y = yt an.(i5{y—yt) = 0 otherwise. Then we can write = —nLy{qy]Py{6)). 
Thus, 

Oy = argmin Ly{qy-py{e)). (3) 

C/GW 

We assume the regularity conditions of White (1982) for consistency and asymptotic normal¬ 
ity of Oy . More specifically, we assume all the regularity conditions (Al) to (A6) for the true 
distribution qy and the model distribution Py{6). In particular, 6y is determined uniquely (i.e., 
identifiable) and is interior to the parameter space 0. We assume that ly{0), Gy{qy]0) and 
Hy{qy-, 0) defined below are nonsingular in the neighborhood of Oy. Then White (1982) showed 
that, as n —)■ oo asymptotically, Oy Oy and 

(e, - e,) 4 Af(0, HpGyHp). (4) 

The matrices are defined as Gy = Gy{qy; Oy) and Hy = Hy{qy; Oy), where 

n ( Q\ f f ^d^oSPyiy'^0)d\ogpy{y;0) 

Gy{gy]0) = J gy{y) - — -- dy, 

Hy{9y,e) = - J gy{y) -- dy. 

In the case of the correct specification qy = py{0y), the matrices become Gy = Hy = lyifOy). 

3. Complete-data divergence 

Here we review Knllback-Leibler divergence for complete data when only incomplete data can 
be observed (Csiszar and Tusnady, 1984; Amari, 1995). 

3.1. Projection to the model manifold 

Let gx and fx be the arbitrary probability distributions of complete data. The complete-data 
Knllback-Leibler divergence from px to fx is 

Dx{gx]fx) = - gxix){\og fxix) -\oggxix))dx. 




H. SHIMODAIRA AND H. MAEDA/An information criterion for incomplete data 


6 




Fig 1. (a) Space of incomplete-data probability distributions. Projection from qy to the model manifold My{py) 
(arrow with a solid line), and that from qy (arrow with a broken line) using eqs. (2) and (3) in Section 2, 
respectively. The dotted line indicates Dy{qy-,py{6y)), which is the loss funetion for Asky^y. (b) Space of complete- 
data probability distributions. Projection from qx to the model manifold Mx(px) using eq. (5) in Section 3.1. 
Projection from px{0) to the data manifold Sx{qy) using eq. (9) in Section 3.2. Alternating projections between 
the two manifolds using eq. (10) in Section 3.3. The dotted line indicates Dxiqx]Px{dy)), which is the loss 
function for Askx-^y. The bold segment indicates Dx{qx]Pz\y{dy)9v)> which is assumed to be zero in (15). 


All the arguments of incomplete data in Section 2 apply to complete data by replacing y with 
X in the notation. For example, we write Dx{gx] fx) = Lx{gx', fx) — Lx{gx) with Lx{gx] fx) = 
— J gx{x)logfx{x)dx and Lx{gx) = Lx{gx]gx)- The projection of Qx to the model manifold 
Mx{px) = {Px{0) : V 0 G 0 } is dehned as 

min Dx{qx] fx) = Dx{qx]Px{Ox)) (5) 

fx&^x{px) 

with Ox = argmin Lx{qx',Px{0))- Figure 1 (b) shows a geometric illustration. Note that Ox 7 ^ Oy 

0 

and Px{0x) 7 ^ Px{0y) in general. 

3.2. Projection to the data manifold 

The following simple lemma helps understand how the incomplete-data divergence and the 
complete-data divergence are related. 

Lemma 1. For two distributions gx{x) and fx{x), we have 


Dx{gx-J fx) Dxidxi fz\ygy) + fy)i 


( 6 ) 
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where fz\ygy represents the distribution fz\y{z\y)gy{y). Therefore, the difference of the two di¬ 
vergences is Dz,{gx] fx)-Dy{gy] fy) = Dx{gx] fziyQy), which is zero if gz\y = fz\y For an arbitrary 
distribution hx{x), the last term in (6) is expressed as 

Fyidyi fy) ~ F)x{hz\ygy, hz\yfy)- (7) 

In particular, choosing hx = fx gives Dy{gy] fy) = Dx{fz\ygy]fx), and 

Fx{gxj fx) Dx{gx^ fzlydy) “ 1 “ Dx{fz\ygy: fx)- ( 8 ) 

We consider the set of all probability distributions gx with the same marginal distribution 
gy = Qy for a specihed Qy. This set is denoted as Sx{qy) = {gziyQy ■ '^gz\y}- Note that the 
elements of Sx{qy) are written as gz\yqy with arbitrary gz\y because J gz\y{z\y)qy{y) dz = qy{y). 
Equations (88) and (57) in Amari (1995) are Sx{qy) and its restriction to a hnite dimensional 
model, respectively, and are called the observed data (sub)manifold there. Here, we call Sx{qy) 
the expected data manifold and Sxify) the observed data manifold, although it may be abuse 
of the word “manifold” for subsets with inhnite dimensions. 

The projection of Px{0) to Sx{qy) should be dehned to minimize the complete-data divergence 
over Sx{qy), but the roles of gx and fx in Dx{gx'i fx) are exchanged from those of (5). We minimize 
Dx{gx;Px{0)) over gx € Sx{qy). By letting gx G Sx{qy) and fx = Px{0) in (6), 

Dx{gx-I Px{S)) — Dx{gz\yqy, Pz\y{S)qy) F Dy[qy, Py[0)), 
which is minimized when gz\y = Pz\y{0)- Therefore, the projection gives the minimum value as 

min Dx{gx;PxiO)) = Dy{qy;py{0)). (9) 

9x&Sa:(qy) 

Using (8), the minimum value can also be written as Dy{qy]Py{0)) = Dx{pz\y{0)qy]Px{G)). 


3.3. Alternating projections between the two manifolds 


The optimal parameter 0y of the incomplete data is interpreted as a dual or alternate mini¬ 
mization problem of complete-data divergence. By minimizing (9) over 0 G 0, we dehne the 
alternating projections between Sx{qy) and Mx{px) as 


min min 

fx^^xi^Px) 9x^Sx{Qy) 


Dx{,gx) fx) 


Dy{qy] Py{0y)), 


( 10 ) 


where the minimum is attained by gx = Pz\y{0y)qy and fx = Px{Sy). See eq. (65) in Amari 
(1995). This implies that pz\y{0y)qy is the best approximation of qx when the two manifolds 
Sx{qy) and MxifPx) are known, while Px{0y) is the best approximation of qx in MxifPx)- This 
interpretation is the key to understanding our problem. 
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The above mentioned geometrical interpretation corresponds to the well known fact that 

the EM algorithm of Dempster, Laird and Rubin (1977) is alternating projections between 

Sx{%) and Mx{px)- See Csiszar and Tusnady (1984), Byrne (1992), Amari (1995), and Ip and 

Lalwani (2000). Starting from the initial value the EM algorithm computes a sequence of 

the parameter values s = 1, 2,...} by the updating formula = argmax QiO] 

0e© 

It follows from L^iPz\yi0i)qy]p^(e2)) = -Q{02]0i)/n that 

= argmin La,{pz\y{O^''^)qy-,Px{0)), 

meaning pa.(0(*+bj jg projection from pz\y{0^^'>)qy to Mx{px)- Alternatively, pz\y{6^^'^)qy is 
the projection from px{0^^'^) to Sx{qy)- Thus, the converging point of the alternating projections 
satisfies 

0y = argmin Lx{pz\y{0y)qy-,Px{0))- (11) 

4. Risk functions for model selection 

By looking at the incomplete-data distributions, the discrepancy between the true distribution 
qy and our estimation Py{0y) is measured by the incomplete-data divergence Dy{qy]py{0y)). If 
we take it as the loss function, the expected loss-function, or the risk function, will measure 

the discrepancy in the long run. Then AIC and its variants are derived as estimators of 

lis]^y,y = E{Dy{qy]Py{0y))}. (l2) 

The expectation is evaluated with respect to although it involves only qy here. This is the 
standard approach in the literature (Akaike, 1974; Bozdogan, 1987; Burnham and Anderson, 
2002; Konishi and Kitagawa, 2008). 

Shimodaira (1994) and Cavanaugh and Shumway (1998) proposed another approach, which 
employs the complete-data divergence Dx{qx]Px{0y)) to measure the discrepancy between the 
complete-data distributions qx and Px{0y)- Using the complete-data divergence as the loss func¬ 
tion, the risk function becomes 

riskj-.y 77 (g^., Pj. (0y)) J-. (13) 

The first suffix x indicates the random variable for the loss function, while the second suffix y 
indicates the random variable for the observation. 

However, estimating (13) is difficult. The complete-data empirical distribution qx{x) = 
^ <^(* ~ *t) is unknown; we only know that qx is somewhere in the observed data manifold 

Sx{Qy)- Considering the limiting situation of n —)■ 00 , we may only know that the true distri¬ 
bution is somewhere in the expected data manifold: qx G Sx{qy). Then the best substitute for 


qx IS 


Qx — Pz\y{0y)qy 


( 14 ) 
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as suggested by (10) from the viewpoint of the alternating projections in Section 3.3. To estimate 
(13), we assume that (14) holds in this paper. This assumption is rephrased as 

Pz\yi,^y)Qy} 0 


or equivalently 

Qz\y — Pz\y{^y)y (15) 

implying that Pz\y{0) is correctly specified for q^iy and that 0^ = Oy, because the two projections 
from Qx and Pz\y{Gy)<iy to M^ipx) become identical as illustrated in Fig. 1 (b). Because it is 
impossible to know how much qz\y actually deviates from Pz\y{6y) when Z = ( 2 i,...,z„) is 
missing completely, we assume (15) in the following argument to derive AICx;j/. Note that 
assumption (15) holds with Ox = Oy = 0 q in the case of the correct specification where qx = 
PxiOo)- 

We are now ready to derive AlCa;;^ as an estimator of 2nhskx-y. The arguments in Lemma 2 
and Theorem 1 almost duplicate that used to derive TIC mentioned in Burnham and Anderson 
(2002) and Konishi and Kitagawa (2008). However, it should be noted that in Lemma 2 the 
first term of riskj-.y is expressed by the incomplete-data divergence instead of the complete-data 
divergence. A point for proving the lemma is that 

DxifixiPxiOy)^ Dx{(lxiPz\y{Oy^qy) -\- Dy{ciy,Py{Oy^^ D y{^y, P y{0 y')^ , (16) 

which follows from Lemma 1 and the assumption (15). Dx{qx]Px{0y)) on the left-hand side 
is the amount of misspecification of Px{0), and can be decomposed into the two parts: 
Dx{qx', Pz\y{0y)qy) and Dy{qy;py{0y)), which are the contribution of Pz\y{0) and Py{0), respec¬ 
tively. To estimate (13), instead of estimating Dx{qx]Pz\y{Oy)qy), we ignore it. 

Lemma 2. Assume the regularity conditions of White (1982) mentioned in Section 2, and also 
assume that (15) holds. Then the expected loss is asymptotically expanded as 

risk^;j, = Dy{qy;py{0y)) + ^ tT{HxH-^GyH-^) + 0(n“^/2). (17) 

The matrices Gy and Hy are those defined in Section 2, and = Hx{pz\y{0y)qy-,0y) with 

HxidxjO) J gx{x) QQQQT 

The dominant term in (17) is also expressed as Dy{qy-,py{0y)) = Ly{qy-,py{0y)) — Ly{qy) using 
the cross-entropy. 
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5. Information criteria 

Let us define an information criterion as an estimator of riska;;^. 

risk^.j^ = Ly{qy]Py{ey)) - Ly{qy) + ^ tl{GyHy^) + ^ tl {Hy^ G yHy^) , (iS) 

where the matrices Gy, Hy and may be replaced by their consistent estimators with error 
When x ^ y, (18) reduces to 

riskj/;j, — Lyi^qy, Pyi^Oy)) “ L yi^^y) + — tl {G y Hy ), (l9) 

which corresponds to the Takeuchi information criterion (TIC) for estimating riskj^j^ mentioned 
in Burnham and Anderson (2002) and Konishi and Kitagawa (2008). In model selection, we 
ignore Ly{qy), because all candidate models have the same value. The first term Ly{qy]Py{6y)) = 
—ly{6y)/n of order Op(l) measures the goodness of fit, while the last two terms of order 0{n~^) 
are interpreted as the penalty of model complexity. Our estimator is justihed by the following 
theorem. 

Theorem 1. Assume the regularity conditions of White (1982) mentioned in Section 2, and 
also assume that (15) holds. Then we have 

Ly{qy;Py{0y)) = E {Ly{qy; Py{0y))} + ^tiiGyHf^) + (20) 

and therefore 

E{nsk^.y} = riska;;^ + (21) 

Thus, the estimator is unbiased asymptotically up to terms of order 0{n~^). 

In the case of the correct specification where qy = Py{0y) for the incomplete-data distribution, 
we have Gy = Hy = Iy{6y), and the information matrix is consistently estimated by Iy{6y). 
Assuming (15), this implies that = PxiOx) is correctly specified for the complete-data distri¬ 
bution. Hence, Hx = Ixi^y) is consistently estimated by lx{0y). For model selection, we assume 
that Py{6) is misspecihed for qy in general. However, these equations may approximately hold 
if Py{6y) is a good approximation of qy. By substituting Gy ^ Hy ^ lyi^y) and Hx ~ hx{0y) 
into (18) and (19), we have 

riskj-.j^ ~ Lyify, Py{dy)) ~ ^yiSy) T T 

and ^ 

riskj^.y Ly{qy;py{ey)) - Ly{qy) + -, 

where Ly{qy) is ignored for model selection. Multiplying by 2n converts these approximations 
to AlCx-,y and AIC, respectively. 
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6. PDIO and AICcd 

The idea behind the derivation of PDIO and AICcd is to replace by 

Qx Pz\yi.^y')Qy (2^) 

This implies (14) by considering the limiting situation of n —)■ cx). Thus, the assumption 
for PDIO and AICcd is stronger than the assumption for AlCa;;^. Substituting (22) into the 
complete-data MLE gives 

0^ = aigmin (23) 

Comparing (23) with (11) gives 0x = 6y Therefore, there should not be any missing data, or at 
least Pz\y{0) should not involve the parameter 6. Consequently, AIC, PDIO, AICcd, and AlCa;;^ 
are equivalent when PDIO and AICcd are justihed under (22). 

Although assumption (22) is too strong to work with, it is interesting to see how PDIO 
and AICcd would be derived if (22) is formally accepted. The argument below to derive PDIO 
and AICcd is rather confusing because (jx is interpreted interchangeably as the complete-data 
empirical distribution or the right-hand side of (22). 

By a similar argument to the proof of Theorem 1, the Taylor expansion of Lx{qx]Px{0)) 
around 6 = 0y is 

Lx{qx;Px{0)) = Lx{qx;Px{0y)) + ^{6 - 6y)^Hx{0 - Oy) + (24) 

with Hx = Hx{qx'i 0y)- Its expectation with 0 = Oy gives 

Lx{qx]Px{0y)) = E{Lx{qx]Px{ey))] + ^tT{HxH-^GyH-^) + (25) 

This corresponds to (20) of Theorem 1. Noticing (16) and thus, Dy{qy]Py{0y)) = Lx{qx',Px{0y)) — 
Lx{qx), and then substituting (25) into (17) gives the estimator of riska;;^ unbiased up to 0{n~^) 
under (22) as 

riska,.j^ = Lx{qx;Px{0y)) - Lx{qx) + ^tT{HxH~^GyH~^). (26) 

The goodness of fit term is Lx{pz\y{0y)qy',Px{^y)) = —Q{0y',0y)/n under (22). Therefore, (26) 
gives AICcd by the same approximation used to derive AlCa,;^. 

In Cavanaugh and Shumway (1998), for evaluating (3.15) there, they assumed that 
E{Q{eQ]ey)} ^ E{Q(0o;^o)} or E{Lx{pz\y{Oy)qy]Px{0o))) ~ Lx{pz\y{Oo)qy;Px{0o)) under 
the correct specihcation qx = Px{0o). The equality holds exactly under (22) because 
E{Lx{qx',Px{0o))) = Lx{qx',Pxi^o)) if Qx is interpreted as the empirical distribution. Unfor¬ 
tunately, the difference is E{Q{0q-, Oy)} — E{Q{0o;0o)} = 0(1) in general without assuming 
(22), leading to the bias of AICcd even when (15) holds. 
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In Shimodaira (1994), (3.5) corresponds to our (24), where 0^ = Oy is assumed implicitly in 
order to ignore the hrst derivative. Although Lx{qx) diverges for continuous random variable x, 
Dx{,qx]Vx{Gy)) = Lx{qx]Vx{Gy)) — Lx{qx) is formally considered. Similar to (16), we then have 
Dx{qx]Vx{Gy)) = Dy{qy]Py{6y)) lu (3.6) there. From this argument, the goodness of fit term of 
(26) is Lx{qx]Px{Oy)) = Ly{qy]Py{0y)) + Lx{qx) - Ly{qy), where Lx{qx) - Ly{qy) is independent 
of the model specification if qx is interpreted as the empirical distribution. Therefore, (26) gives 
PDIO because Lx{qx',Px{0y)) can be replaced with Ly{qy]Py{6y)) for model selection. 

7. Simulation study 
7 . 1. Simulation 1 

To verify Theorem 1, we performed a simulation study of the two-component normal mixture 
model dehned as follows. Let G {1,2} be a discrete random variable for the component 
label, and y G M be a continuous random variable for the observation. The distribution of 2 ; 
is P{z = i) = Tti and the conditional distribution of y given z = i is the normal distribution 
with mean /ij and variance af. The true parameter for data generation is specified as Oq = 
(vTi,/xi,/i 2 , cr^, cr^) = (0.6,—1,1, 0.7^, 0.7^). We consider two candidate models for selection. 
Model 1 is a two-component normal mixture model with a constraint af = cr^ (d = 4), whereas 
Model 2 is the same model without the constraint (d = 5). Because these two models are 
correctly specified, (15) holds. However, (22) obviously does not. 

We generated B = 4000 datasets with sample size n = 100, 200, 500,1000, 2000, 5000,10000. 
They are denoted as = {xf \ ... b = 1,...,H. We also generated datasets of 

sample size n = 15000, which are denoted as = {x^^\ ... for computing the loss 

functions. For each X^^'> = and Model k, k = 1,2 , we computed the informa¬ 

tion criteria AIC(y*^^\ k), PDIOCF*^^^ k), AICcd(H*^^\ k), X[Cx-,y{y^^\ k), and the loss functions 
losSy;j^(y'(^), k) = Ly{qy;py{eil’^)), lo ss^; y (X ^ , fc) = L x{q x] P xi^^)), where O^y'^ is computed from 
In the formulas below, denotes that the expectation on the left-hand side is com¬ 
puted numerically by the simulation on the right-hand side. The loss functions are computed 
numerically by 

\osSy,y{Y^''\k) - ^^\ogpy{yf^-e^^^), 

^ t=i 
1 ” 

\0SSx,y{X^^\k) --^\ogPx{xf^]6^y'^), 

t=i 

where px, Py, and Oy^'^ are for Model k. Then the expectation with respect to qx = Px{0o) is 
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Table 1 

Expected values of the information criteria and the risk functions in Simulation 1. These values are differences 

between the two models with standard errors in parentheses. 


n 

too 

200 

500 

1000 

2000 

5000 

10000 

E{AAIC) 

0.810 

(0.027) 

0.898 

(0.025) 

0.982 

(0.023) 

0.978 

(0.023) 

0.986 

(0.023) 

0.982 

(0.023) 

1.04 

(0.022) 

£;(APDIO) 

43.5 

(1.64) 

41.1 

(0.716) 

37.0 

(0.344) 

36.0 

(0.220) 

34.9 

(0.141) 

34.4 

(0.088) 

34.2 

(0.064) 

i;(AAICed) 

42.3 

(1.67) 

41.0 

(0.793) 

37.2 

(0.518) 

36.6 

(0.494) 

35.2 

(0.573) 

35.5 

(0.812) 

33.5 

(1.08) 

E{AAlC,.y) 

22.1 

(0.821) 

21.0 

(0.361) 

19.0 

(0.174) 

18.5 

(0.113) 

18.0 

(0.074) 

17.7 

(0.049) 

17.6 

(0.037) 

2nArisky;y 

1.83 

(0.052) 

1.47 

(0.040) 

1.15 

(0.030) 

1.08 

(0.027) 

1.03 

(0.026) 

1.02 

(0.030) 

0.967 

(0.033) 

2nArisk3;.y 

100.9 

(40.3) 

28.9 

(1.39) 

20.3 

(0.620) 

18.6 

(0.487) 

18.2 

(0.456) 

17.5 

(0.464) 

17.0 

(0.430) 


computed by the simulation average. For example, 

1 ^ 

f;(aaic) 1 ) - aic(f('’\ 2 )), 

b=l 

1 ^ 

Ariska,;y — ^(losSa;;j^(X^^\ 1) - \osSx-y{X^^\ 2)). 
b=i 

This Monte Carlo method calculates the expectation accurately for sufficiently large n and B. 

The result shown in Table 1 verifies Theorem 1. For sufficiently large n, i?(AAIC) = 
2nArisky;j^ and i?(AAICa;;y) = 2nAriska;;j, hold very well. On the other hand, i?(APDIO) differs 
significantly from 2nAriskj^;y and 2nAriska;;j^. Thus, PDIO is not a good estimator of either of 
these risk functions. In addition, the expected value of AICcd is similar to that of PDIO, but 
its variation is larger than PDIO, as seen in the standard errors. 

Let us consider the difference PDIO — AICcd 

n 

diff(F, Oy) = 2Q{0y; Oy) - 2iy{ey) = 2 

t=l 

and its difference between the two models, which is denoted as Adiff(D, 6y) = APDIO—AAICcd. 
Adiff (F, 6y) and F^(Adiff(F, 0y)) can be very large, and they are 0{n) under model misspec- 
ihcation. If (15) holds, as is the case of Table 1, E{diS(Y,6y)) = 2n J q^i^x) log qz\y{z\y) dx is 
independent of the model. Therefore, the difference becomes smaller; Adiff(F, 0^,) = Opiy/d) 
and E{AdiS{Y,6y)) = 0(1). 


jPz\y{z\yt, Oy) \ogp^\y{z\yp Oy) dz, 
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7 . 2. Simulation 2 

We next performed a simulation study on the three-component normal mixture model to 
examine how well the information criteria work for model selection in a practical situation 
where some candidate models do not satisfy assumption (15). The true parameter value is 
6q = {711,712, Ri, R 2 , R 3 ,o'f, 02,(7^) = (0.5, 0.3, —2,0, 3, 0.7^, 0.7^, 1^). We consider hve candi¬ 
dates with the following constraints. Model 1 is erf = ct| = cr| {d = 6). Model 2 is erf = er| 
{d = 7). Model 3 is erf = er| {d = 7). Model 4 is erf = erf {d = 7), and Model 5 has no constraint 
{d = 8 ). Model 1, Model 2, and Model 3 are misspecihed and do not satisfy (15). Model 4 
and Model 5 are correctly specihed and satisfy (15). None of the models satisfy (22). We have 
generated B = 10000 datasets of n = 500 and n = 2000. 

Table 2 shows the model selection results. Model 4 is the best model in the sense that it 
minimizes both risk^.^^ and risk^..^^ (Table 3). All the information criteria tend to select Model 4. 
AIC tends to choose a more complex model (i.e.. Model 2 or Model 5) than the other criteria, 
indicating a smaller penalty for model complexity. PDIO tends to choose a simpler model (i.e.. 
Model 1), implying a larger penalty for model complexity. 

To compare candidate models in the long run, the expected loss of each Model k relative to 
that of Model 4 is computed by 

1 ^ 

Ariska,;y(/c) —^{loss,,.y{X'^’’\k) - \oss,^.y{X^^\4)). 
b=l 

Table 3 (upper) shows the results. The most complex model (Model 5) is the second best 
in terms of riskj^.^, but the simplest model (Model 1) is the second best in terms of risk^-.y, 
indicating a large contribution of Pz\y{0) to the second term of (17). 

The information criterion performance is measured by the expected loss of the selected model. 
For example, the performance of AIC in terms of complete data is measured by 

1 ^ 

Ariskj,;j/(AIC) —^{\oss,,.y{X^^\k^’’^) - \oss,,-y{X^^\4:)), 

b=l 

where is the minimum AIC model computed from Table 3 (lower) shows the results, 
where the value in bold denotes the minimum value of each column. AIC outperforms the other 
criteria in terms of riskj^j^, and AICa;;j^ outperforms the other criteria in terms of riska;;^. In this 
example, some models do not satisfy assumption (15), but AIC and XlCx-,y work very well as 
expected. 

8. Concluding remarks 

We derived AlCa;;^ as an unbiased estimator of the expected Kullback-Leibler divergence be¬ 
tween the true distribution and the estimated distribution of complete data when only incom- 
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Table 2 

Frequency of model selection in Simulation 2. 



Model I 
(d = 6) 

Model 2 
(d = 7) 

Model 3 
(d=7) 

Model 4* 
(d = 7) 

Model 5* 
id = 8) 

AIC 

881 

2419 

262 

5600 

838 

PDIO 

5442 

16 

4 

4534 

4 

AlCed 

2063 

2 

974 

6551 

410 

AlCa;;j, 

3704 

65 

15 

6190 

26 


* correctly specified model 


Table 3 

Risk functions for models and those for information criteria in Simulation 2. These values are relative to 

Model 4 with standard errors in parentheses. 



2nArisky;y 

2nAriska;;y 

Model 1 

6.60 

(0.04) 

33.2 

(0.21) 

Model 2 

1.40 

0-02) 

59.2 

(0.71) 

Model 3 

7.86 

(0.04) 

80.7 

(0.80) 

Model 4* 

0 

(0.00) 

0 

(0.00) 

Model 5* 

1.32 

(0.02) 

45.6 

(0.87) 

AIC 

1.44 

(0.03) 

39.6 

(0.91) 

PDIO 

3.57 

(0.04) 

19.6 

(0.30) 

AlCed 

2.33 

(0.04) 

28.2 

(0.72) 

AlCx-y 

2.36 

(0.04) 

14.8 

(0.43) 


* correctly specified model 


plete data is available. In Simulation 1, AlCa;;^ and AIC are unbiased up to the penalty terms, 
whereas PDIO and AICcd are not. 

To derive AlCa;;^, we assumed (15), meaning that the conditional distribution Pz\yiG) of the 
missing data given the incomplete data is correctly specihed, while the marginal distribution 
Py{0) of the incomplete data is misspecihed in general. However, the conditional distribution 
is misspecihed in practice. In Simulation 2, we observed that AlCa;;^ and AIC perform better 
than the other criteria even if some models are misspecihed. Without assumption (15), the 
dominant term in (17) is D^{q:c]Px{Oy)) = T>x(gx;Pz|y(0j;)%) + Dy{qy]Py{0y)) > Dy{qy]Py{0y)). 
Thus, AICa;;j^ estimates the lower bound of 2?7,riska,.2^. It is impossible to reasonably estimate 
the ignored term Dx{qx]Pz\y{Sy)qy) in our setting where 2 ^ 1 ,..., ^„ are missing completely. 

Although we assume that pz\y{0) is correctly specihed, it is benehcial to include Pz\y{0) as a 
part of PxiO) = Pz\y{G)py{0) for model selection. The variance of Oy causes Pz\y{Gy) to huctuate 
even if Pz\y{0y) = (lz\y The amount of this random variation is measured by the additional 
penalty term (1) in AlCa;;^^. 

In the future, we plan to work on more complicated missing mechanisms or combine a missing 
mechanism with other sampling mechanisms, such as the covariate-shift (Shimodaira, 2000) 
problem. One important extension is semi-supervised learning (Chapelle, Scholkopf and Zien, 
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2006; Kawakita and Takeuchi, 2014), where the log-likelihood function is 

n n-\-n' 

\ogpy{yt]6) + ^ logp^{xt]6). 
t=l t=n~\-l 

In this case, the additional complete data Xn+i, ■ ■ ■, Xn+n' helps estimate conditional distribu¬ 
tion qz\y. We may reasonably estimate Dx{qx]Pz\y{Sy)qy) without assuming (15), leading to a 
new information criterion, which will be the subject in future research. 
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Appendix A: Technical details 
A.l. Proof of Lemma 1 

For brevity, we omit {y,z) of fx{y,z) in the integrals below. Dx,{gx]fx) = f f 9z\ygy{^oggz\y 
\oggy - log f^\y - log fy)dzdy = j gy j g,\y{logg,\y - log f^\y)dzdy + j gy{j g^\ydz){loggy - 
log fy)dy = J gy J g^iy{logg^iygy-log f^iygy)dzdy +J gy{loggy-log fy)dy = Dx{gz\ygy] fz\ygy) + 
Dyigy-Jy), thus showlug (6). Dy{gy-Jy) = J J h^iygy{loggy - logfy + logh^iy - loghz\y)dzdy = 
Dx{hz\ygy]hz:\yfy), wlilch shows (7). 


A. 2. Proof of Lemma 2 

We assume qz\y = Pz\y{Gy) and Ox = Oy. From the dehnitions of O^ and Hx, we have 


dDx{qx;Px{0)) 


dO 


= 0 , 


d^Dx:{qx;Px{0)) 


dOdO^ 


= H^. 


Hence, the Taylor expansion of Dx{qx]Px{0)) around 6 = Oy is 


for 6 — 6,1 = 


Dx{qx-,Px{0)) = Dx{qx-,Px{0y)) + ^{6 - Oy^Hx,{6 - 6y) + 0 {n ^/^) 

0(n“^/^). The first term on the right-hand side is Dy{qy]Py{ 6 y)) as shown in (16). 


Substituting 6 = Oyin. Dx{qx]Px{0)) and taking its expectation gives (17) by noting 
^ I ~ ^y) Hx{0y — Oy) I = tr (^Hx E | {6y — 0y){6y — 0y 
which becomes tr [HxHf^GyHf^) /n -|- 0{n~‘^) from (4). 
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A.3. Proof of Theorem 1 

From the definitions of 6 y and Hy = Hy{qy; 6y), we have 

dLy{qy]Py{0)) ^ 8 ^ Lyi^y] PyjO)) 

do Oy ’ aeae^ 

Hence, the Taylor expansion of Ly{qy]Py{6)) aronnd 0 = 6y is 

Ly{qy;Py{0)) = Ly{qy; Py{0y)) + ^{0 - 0yfHy{0 - 0y) + 

for 0 — 0y = Op{n~^/‘^). Snbstitnting 0 = 0y in Ly{qy;py{0)), we take its expectation below. By 

noting Hy = Hy + we have 

B {(e„ - - 4)} = tr (h, b {(«, - 0„)(e„ - + o{n-r\ 

which becomes ii{HyHf^GyHf^)/n + from (4). This proves (20) becanse 

E{Ly{qy]Py{0y))} = E {L y{qy] P y{0 y))} T ^ t r ( Gy if " ^ ) T O ( TT " ^ ^ ^ , 

and E{Ly{qy]Py{0y))} = Ly{qy]Py{0y)). Substitntmg (20) into (17) and comparing it with (18) 

yields (21). 
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